🤖Machine LearningmethodLLM EvaluationOn this pageLLM EvaluationMethod use DataSet to evaluate LLMs The DataSet is math problem Paper: DeepSeek-R1 Problem How many problem are LLM memory? Paper: GSM-Symbolic Paper: Premise Order Matters in Reasoning with Large Language Models If we change DataSet is it good enough? Paper: ARC-AGI Conclusion - Goodhart's Law "When a measure becomes a target, it ceases to be a good measure"